Bag-of-Concepts Document Representation for Textual News Classification

نویسندگان

  • Marcos Mouriño-García
  • Roberto Pérez-Rodríguez
  • Luis E. Anido-Rifón
چکیده

Automatic classification of news articles is a relevant problem due to the large amount of news generated every day, so it is crucial that these news are classified to allow for users to access to information of interest quickly and effectively. Traditional classification systems represent documents as bag-of-words (BoW), which are oblivious to two problems of language: synonymy and polysemy. This paper shows the advantages of using a bag-of-concepts (BoC) representation of documents, which tackles synonymy and polysemy, in text news classification – using a Support Vector Machines algorithm. In order to create BoC representations, a Wikipedia-based semantic annotator is used. To evaluate the proposal we used a purpose-built corpus and the Reuters 21578 corpus. Results show that the efficiency of the BoC approach is very dependent on the performance of the semantic annotator in extracting concepts, which depends heavily on the characteristics of particular corpora, reaching performance increases up to 29.65%.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

"Bag of Events" Approach to Event Coreference Resolution. Supervised Classification of Event Templates

We propose a new robust two-step approach to cross-textual event coreference resolution on news articles. The approach makes explicit use of event and discourse structure thereby compensating for implications of the Gricean Maxim of quantity. News follows the principle of language economy. Information tends not to be repeated within discourse boarders. This phenomenon poses a challenge for mode...

متن کامل

Genre Document Classification Using Flexible Length Phrases

In this paper we investigate possibility of using phrases of flexible length in genre classification of textual documents as an extension to classic bag of words document representation where documents are represented using single words as features. The investigation is conducted on collection of articles from document database collected from three different sources representing different genre...

متن کامل

Algorithm for Classification of Textual Documents Represented by Tandem Analysis

In this research is presented algorithm for classification of textual documents which are represented in the space of reduced dimension in respect to original bag of words representation. Algorithm is carried out in two steps: in the first step classification is conducted for documents represented in original bag of words representation, while in the second step classification is conducted for ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. Comput. Linguistics Appl.

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2015